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Abstract 

This paper considers the problem of clustering a collection of unlabeled data points assumed 
to lie near a union of lower dimensional planes. As is common in computer vision or unsuper- 
vised learning applications, we do not know in advance how many subspaces there are nor do 
we have any information about their dimensions. We develop a novel geometric analysis of an 
algorithm named sparse subspace clustering (SSC) [11], which significantly broadens the range 
of problems where it is provably effective. For instance, we show that SSC can recover multiple 
subspaces, each of dimension comparable to the ambient dimension. We also prove that SSC can 
correctly cluster data points even when the subspaces of interest intersect. Further, we develop 
an extension of SSC that succeeds when the data set is corrupted with possibly overwhelm- 
ingly many outliers. Underlying our analysis are clear geometric insights, which may bear on 
other sparse recovery problems. A numerical study complements our theoretical analysis and 
demonstrates the effectiveness of these methods. 

Keywords. Subspace clustering, spectral clustering, outlier detection, ii minimization, duality in linear 
programming, geometric functional analysis, properties of convex bodies, concentration of measure. 

1 Introduction 

1.1 Motivation 

One of the most fundamental steps in data analysis and dimensionality reduction consists of ap- 
proximating a given dataset by a single low-dimensional subspace, which is classically achieved 
via Principal Component Analysis (PCA). In many problems, however, a collection of points may 
not he near a low-dimensional plane but near a union of multiple subspaces as shown in Figure 1. 
It is then of interest to find or ht all these subspaces. Furthermore, because our data points are 
unlabelled in the sense that we do not know in advance to which subspace they belong to, we need 
to simultaneously cluster these data into multiple subspaces and find a low-dimensional subspace 
approximating all the points in a cluster. This problem is known as subspace clustering and has 
numerous applications; we list just a few. 

• Unsupervised learning. In unsupervised learning the goal is to build representations of ma- 
chine inputs, which can be used for decision making, predicting future inputs, efficiently 
communicating the inputs to another machine, and so on. 
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Figure 1 



In some unsupervised learning applications, the standard assumption is that the data is well 
approximated by a union of lower dimensional manifolds. Furthermore, these manifolds are 
sometimes well approximated by subspaces whose dimension is only slightly higher than that 
of the manifold under study. Such an example is handwritten digits. When looking at hand- 
written characters for recognition, the human eye is able to allow for simple transformations 
such as rotations, small scalings, location shifts, and character thickness. Therefore, any 
reasonable model should be insensitive to such changes as well. Simard et. al. [36] char- 
acterize this invariance with a 7-dimensional manifold; that is, different transformations of 
a single digit are well approximated by a 7-dimensional manifold. As illustrated by Hastie 
et. al. [17] these 7-dimensional manifolds are in turn well approximated by 12-dimensional 
subspaces. Thus in certain cases, unsupervised learning can be formulated as a subspace 
clustering problem. 

• Computer vision. There has been an explosion of visual data in the past few years. Cameras 
are now everywhere: street corners, traffic lights, airports and so on. Furthermore, millions 
of videos and images are uploaded monthly on the web. This visual data deluge has mo- 
tivated the development of low-dimensional representations based on appearance, geometry 
and dynamics of a scene. In many such applications, the low-dimensional representations are 
characterized by multiple low-dimensional subspaces. One such example is motion segmen- 
tation [44]. Here, we have a video sequence which consists of multiple moving objects, and 
the goal is to segment the trajectories of the objects. Each trajectory approximately lies in a 
low-dimensional subspace. To understand scene dynamics, one needs to cluster the trajecto- 
ries of points on moving objects based on the subspaces (objects) they belong to, hence the 
need for subspace clustering. 

Other applications of subspace clustering in computer vision include image segmentation [49] , 
face clustering [18], image representation and compression [19], and systems theory [42]. Over 
the years, various methods for subspace clustering have been proposed by researchers working 
in this area. For a comprehensive review and comparison of these algorithms, we refer the 
reader to the tutorial [45] and references therein, [4, 10, 14, 43, 5, 40, 1, 29, 50, 38, 37, 30, 
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34, 48, 47, 51, 16, 11, 12, 27, 9]. 



• Disease detection. In order to detect a class of diseases of a specific kind (e.g. metabolic), 
doctors screen specific factors (e.g. metabolites). For this purpose, various tests (e.g. blood 
tests) are performed on the newborns and the level of those factors are measured. One can 
further construct a newborn-factor level matrix, where each row contains the factor levels of 
a different newborn. That is to say, each newborn is associated with a vector containing the 
values of the factors. Doctors wish to cluster groups of newborns based on the disease they 
suffer from. Usually, each disease causes a correlation between a specific set of factors. Such 
an assumption implies that points corresponding to newborns suffering from a given disease 
lie on a lower dimensional subspace [33]. Therefore, the clustering of newborns based on their 
specific disease together with the identification of the relevant factors associated with each 
disease can be modeled as a subspace clustering problem. 

PCA is perhaps the single most important tool for dimensionality reduction. However, in 
many problems, the data set under study is not well approximated by a linear subspace of lower 
dimension. Instead, as we hope we have made clear, the data often lie near a union of low- 
dimensional subspaces, reflecting the multiple categories or classes a set of observations may belong 
to. Given its relevance in data analysis, we find it surprising that subspace clustering has been 
well studied in the computer science literature but has comparably received little attention from 
the statistical community. This paper begins with a very recent approach to subspace clustering, 
and proposes a framework in which one can develop some useful statistical theory. As we shall see, 
insights from sparse regression analysis in high dimensions — a subject that has been well developed 
in the statistics literature in recent years — inform the subspace clustering problem. 

1.2 Problem formulation 

In this paper, we assume we are given data points that are distributed on a union of unknown 
linear subspaces S*! u 52 u . . . u Sl', that is, there are L subspaces of R" of unknown dimensions 
di,d2, ■ ■ ■ ,dL. More precisely, we have a point set X c R" consisting of N points in R*^, which may 
be partitioned as 



for each i > 1, is a collection of Ni unit-normed vectors chosen from Si. The careful reader will 
notice that we have an extra subset in (1.1) accounting for possible outliers. Unless specified 
otherwise, we assume that this special subset consists of A'^o points chosen independently and 
uniformly at random on the unit sphere. The task is now simply stated. Without any prior 
knowledge about the number of subspaces, their orientation or their dimension, 

1) identify all the outliers, and 

2) segment or assign each data point to a cluster as to recover all the hidden subspaces. 

It is worth emphasizing that our model assumes normalized data vectors; this is not a restrictive 
assumption since one can always normalize inputs before applying any subspace clustering algo- 
rithm. Although we consider linear subspaces, one can extend the methods of this paper to affine 
subspace clustering which will be explained in Section 1.3.1. 

We now turn to methods for achieving these goals. Our focus is on noiseless data and we refer 
the reader to [8] for work concerning subspace recovery from noisy samples. 
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1.3 Methods and contributions 



To introduce our methods, we first consider the case in which there are no outliers before treating 
the more general case. From now on, it will be convenient to arrange the observed data points as 
columns of a matrix X - [xi, . . . ,xn] e R"'*^, where N - Nq + A'^i + . . . + is the total number of 
points. 

1.3.1 Methods 

Subspace clustering has received quite a bit of attention in recent years and in particular, Elhamifar 
and Vidal introduced a clever algorithm based on insights from the compressive sensing literature. 
The key idea of the Sparse Subspace Clustering (SSC) algorithm [11] is to find the sparsest expansion 
of each column Xi of as a linear combination of all the other columns. This makes a lot of sense 
because under some generic conditions, one expects that the sparsest representation of Xi would 
only select vectors from the subspace in which Xi happens to lie in. This motivates Elhamifar and 
Vidal to consider the sequence of optimization problems 

min \\z\\f^_^ subject to Xz - Xi and Zi - 0. (1-2) 

The hope is that whenever Zj t 0, xi and Xj belong to the same subspace. This property is captured 
by the definition below. 

Definition 1.1 (£i Subspace Detection Property) The subspaces {Si}^^^ and points X obey 
the ii subspace detection property if and only if it holds that for all i, the optimal solution to (1.2) 
has nonzero entries only when the corresponding columns of X are in the same subspace as Xi. 

In certain cases the subspace detection property may not hold, i.e. the support of the optimal 
solution to (1.2) may include points from other subspaces. However, it might still be possible 
to detect and construct reliable clusters. A strategy is to arrange the optimal solutions to (1.2) 
as columns of a matrix Z e R^'*^, build an affinity graph G with vertices and weights Wij - 
\Zij\ + \Zji\, construct the normalized Laplacian of G, and use a gap in the distribution of eigenvalues 
of this matrix to estimate the number of subspaces. Using the estimated number of subspaces, 
spectral clustering techniques (e.g. [35, 32]) can be applied to the affinity graph to cluster the data 
points. The main steps of this procedure are summarized in Algorithm 1. This algorithm clusters 
linear subspaces but can also cluster affine subspaces by adding the constraint Z^l = 1 to (1.2). 

1.3.2 Our contributions 

In Section 3 we will review existing conditions involving a restriction on the minimum angle between 
subspaces under which Algorithm 1 is expected to work. The main purpose of this paper is to show 
that Algorithm 1 works in much broader situations. 

• Subspaces with non-trivial intersections. Perhaps unexpectedly, we shall see that our 
results assert that SSC can correctly cluster data points even when our subspaces intersect 
so that the minimum principal angle vanishes. This is a phenomenon which is far from being 
explained by current theory. 
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Algorithm 1 Sparse Subspace Clustering (SSC) 



Input: A data set X arranged as columns oi X ^ R"'*" . 

1. Solve (the optimization variable is the N x N matrix Z) 

minimize ll'^llfj 
subject to XZ - X 

diag(Z) = 0. 

2. Form the affinity graph G with nodes representing the N data points and edge weights given 
hyW^ \Z\ + \zf. 

3. Sort the eigenvalues fii > o"2 > . . . > ctat of the normalized Laplacian of G in descending order, 
and set 

L - N - argmax (dj - cJi+i). 

i=l,...,N-l 

4. Apply a spectral clustering technique to the affinity graph using L as the estimated number 
of clusters. 

Output: Partition Xi, . . . , Xf . 



• Subspaces of nearly linear dimension. We prove that in generic settings, SSC can 
effectively cluster the data even when the dimensions of the subspaces grow almost linearly 
with the ambient dimension. We are not aware of other literature explaining why this should 
be so. To be sure, in most favorable cases, earlier results only seem to allow the dimensions 
of the subspaces to grow at most like the square root of the ambient dimension. 

• Outlier detection. We present modifications to SSC that succeed when the data set is 
corrupted with many outliers — even when their number far exceeds the total number of clean 
observations. To the best of our knowledge, this is the first algorithm provably capable of 
handling these many corruptions. 

• Geometric insights. Such improvements are possible because of a novel approach to an- 
alyzing the sparse subspace clustering problem. This analysis combine tools from convex 
optimization, probability theory and geometric functional analysis. Underlying our methods 
are clear geometric insights explaining quite precisely when SSC is successful and when it is 
not. This viewpoint might prove fruitful to address other sparse recovery problems. 

Section 3 proposes a careful comparison with the existing literature. Before doing so, we first need 
to introduce our results, which is the object of Sections 1.4 and 2. 

1.4 Models and typical results 
1.4.1 Models 

In order to better understand the regime in which SSC succeeds as well as its limitations, we will 
consider three different models. Our aim is to give informative bounds for these models highlighting 
the dependence upon key parameters of the problem such as 1) the number of subspaces, 2) the 
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dimensions of these subspaces, 3) the relative orientations of these subspaces, 4) the number of 
data points per subspace, and so on. 



• Deterministic model. In this model, the orientation of the subspaces as well as the dis- 
tribution of the points on each subspace are nonrandom. This is the setting considered by 
Elhamifar et. al. and is the subject of Theorem 2.5, which guarantees that the subspace de- 
tection property holds as long as for any two subspaces, pairs of (primal and dual) directions 
taken on each subspace have a sufficiently small inner product. 

• Semi-random model. Here, the subspaces are fixed but the points are distributed at 
random on each of the subspaces. This is the subject of Theorem 2.8, which uses a notion of 
affinity to measure closeness between any two subspaces. This affinity is maximal and equal 
to the square root of the dimension of the subspaces when they overlap perfectly. Here, our 
results state that if the affinity is smaller, by a logarithmic factor, than its maximum possible 
value, then SSC recovers the subspaces exactly. 

• Fully random model. Here, both the orientation of the subspaces and the distribution of 
the points are random. This is the subject of Theorem 1.2; in a nutshell, SSC succeds as long 
as the dimensions of the subspaces are within at most a logarithmic factor from the ambient 
dimension. 



1.4.2 Segmentation vi^ithout outliers 

Consider the fully random model first. We establish that the subspace detection property holds as 
long as the dimensions of the subspaces are roughly linear in the ambient dimension. Put differently, 
SSC can provably achieve perfect subspace recovery in settings not previously understood. 

Our results make use of a constant c{p) only depending upon the density of inliers (the number 
of points on each subspace is pd+1), and which obeys the following two properties: 

(i) For ah p> 1, c(p) > 0. 

(ii) There is a numerical value po, such that for all p> po, one can take c{p) - 

Theorem 1.2 Assume there are L subspaces, each of dimension d, chosen independently and uni- 
formly at random. Furthermore, suppose there are pd+1 points chosen independently and uniformly 
at random on each subspace} Then the subspace detection property holds with large probability as 
long as 

c^{p)\ogp 

d< ; n (1.3) 

121ogA^ ^ ^ 

(N - L{pd + 1) is the total number of data points). The probability is at least 1 - - Ne^'^'^, 
which is calculated for values of d close to the upper bound. For lower values of d, the probability 
of success is of course much higher, as explained below. 



^From here on, when we say that points are chosen from a subspace, we imphcitly assume they are unit normed. For 

d 

ease of presentation we state our results for 1 < p < e 2 , i.e. the number of points on each subspace is not exponentially 
large in terms of the dimension of that subspace. The results hold for all p > 1 by replacing p with min{p, e 2 }. 



6 



Theorem 1.2 is in fact a special instance of a more general theorem that we shall discuss later, 
and which holds under less restrictive assumptions on the orientations of the subspaces as well 
as the number and positions of the data points on each subspace. This theorem conforms to 
our intuition since clustering becomes more difficult as the dimensions of the subspaces increase. 
Intuitively, another difficult regime concerns a situation in which we have very many subspaces of 
small dimensions. This difficulty is reflected in the dependence of the denominator in (1.3) on L, 
the number of subspaces (through A^) . A more comprehensive explanation of this effect is provided 
in Section 2.1.2. 

As it becomes clear in the proof (see Section 7), a slightly more general version of Theorem 1.2 
holds, namely, with < /3 < 1, the subspace detection property holds as long as 



d< 2/3 



c^(p) logp 
121ogiV 



n (1.4) 



with probability at least 1 - - Ne^^^ Therefore, if d is a small fraction of the right-hand side 
in (1.3), the subspace detection property holds with much higher probability, as expected. 

An interesting regime is when the number of subspaces L is fixed and the density of points per 
subspace \s p = (F, for a small 77 > 0. Then as n 00 with the ratio d/n fixed, it follows from 
N X Lpd and (1.4) using /3 = 1 that the subspace detection property holds as long as 

d < — — ^ n. 

48(1 + r?) 

This justifies our earlier claims since we can have subspace dimensions growing linearly in the 
ambient dimension. It should be noted that this asymptotic statement is only a factor 8-10 
away from what is observed in simulations, which demonstrates a relatively small gap between our 
theoretical predictions and simulations.^ 



1.4.3 Segmentation with outliers 

We now turn our attention to the case where there our extraneous points in the data in the sense 
that there are Nq outliers assumed to be distributed uniformly at random on the unit sphere. Here, 
we wish to correctly identify the outlier points and apply any of the subspace clustering algorithms 
to the remaining samples. We propose a very simple detection procedure for this task. As in SSC, 
decompose each Xi as a linear combination of all the other points by solving an £i-minimization 
problem. Then one expects the expansion of an outlier to be less sparse. This suggests the following 
detection rule: declare Xi to be an outlier if and only if the optimal value of (1.2) is above a fixed 
threshold. This makes sense because if Xi is an outlier, one expects the optimal value to be on the 
order of \/n (provided is at most polynomial in n) whereas this value will be at most on the 
order of \/d if Xi belongs to a subspace of dimension d. In short, we expect a gap — a fact we will 
make rigorous in the next section. The main steps of the procedure are shown in Algorithm 2. 

Our second result asserts that as long as the number of outliers is not overwhelming. Algorithm 
2 detects all of them. 



^To be concrete, when the ambient dimension is = 50 and the number of subspaces is L = 10, the subspace detection 
property holds for d in the range from 7 to 10. 

^Here, 7 = ^^^^ is the total point density and A is a threshold ratio function whose value shall be discussed later. 
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Algorithm 2 Subspace clustering in the presence of outliers 



Input: A data set X arranged as columns of X € R"-^^ . 

1. Solve 

minimize \\Z\\ 
subject to XZ = X 

diag(Z) = 0. 

2. For each z e {1, . . . , N}, declare i to be an outlier iff \\zi\\^ > X{^)y/n.^ 

3. Apply a subspace clustering to the remaining points. 
Output: Partition . . . , X^. 



Theorem 1.3 Assume there are points to be clustered together with Nq outliers sampled uni- 
formly at random on the n - 1-dimensional unit sphere (N - Nq + N^). Algorithm 2 detects all of 
the outliers with high probability'^ as long as 

No<-e'^-Nd, 
n 

where c is a numerical constant. Furthermore, suppose the subspaces are d-dimensional and of 
arbitrary orientation, and that each contains pd + 1 points sampled independently and uniformly 
at random. Then with high probability,^ Algorithm 2 does not detect any subspace point as outlier 
provided that 

No<np'^'i-Nd 

in which C2 = c^(p)/(2e^7r). 

This result shows that our outlier detection scheme can reliably detect all outliers even when 
their number grows exponentially in the root of the ambient dimension. We emphasize that this 
holds without making any assumption whatsoever about the orientation of the subspaces or the 
distribution of the points on each subspace. Furthermore, if the points on each subspace are 
uniformly distributed, our scheme will not wrongfully detect a subspace point as an outlier. In the 
next section, we show that similar results hold under less restrictive assumptions. 

2 Main Results 

2.1 Segmentation without outliers 

In this section, we shall give sufficient conditions in the fully deterministic and semi-random model 
under which the SSC algorithm succeeds (we studied the fully random model in Theorem 1.2). 

Before we explain our results, we introduce some basic notation. We will arrange the A^ points 
on subspace S'^ as columns of a matrix X^^\ For I - 1,...,L, i - l,...,Ni, we use X^^) to 
denote all points on subspace Si excluding the ith point, X^^^ - [x^^\ . . . ,x^^^,x^^^, . . . ,a^^^^]. We 
use U^^^ e ^nxdi denote an arbitrary orthonormal basis for Si. This induces a factorization 

''With probability at least 1 - 7Voe"^"^'°*5(^o+^<*\ If iVo < ^e"^ - Na, this is at least 1 - i. 

^With probability at least 1 - Noe-'^"'^°<^'^^o+^-''> - Nae'-^''. If iVo < min{ne'=2 5 , ^e"^} - Na, this is at least 1 - ^ - 
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= ?7(^)a(^), where A^^^ = [a[^\- ■ • ^ R'^'^'^^^ is a matrix of coordinates with unit-norm 

columns. For any matrix X e R"'*^, the shorthand notation V{X) denotes the symmetrized convex 
huh of its columns, V{X) - conv(±a;i, ±0^2, • • • , ±xj\[). Also Vi^ stands for V{X^^^). Finally, ||X|| 
is the operator norm of X and \\X\\p the maximum absolute value of its entries. 



2.1.1 Deterministic model 

We first introduce some basic concepts needed to state our deterministic result. 

Definition 2.1 (dual point) Consider a vector y eW^ and a matrix A e R'^'*^, and let C* be the 
set of optimal solutions to 

max (y,A) subject to ||A^A|| < 1. 

The dual point X{y,A) e R"^ is defined as a point in C* with minimum Euclidean norm.^ A 
geometric representation is shown in Figure 2. 



-as. 




+a3 



+a2 



-ai 



Figure 2: Geometric representation of a dual point, see Definition 2.1. 

Definition 2.2 (dual directions) Define the dual directions v^^^ e R" (arranged as columns of a 
matrix V^^^ ) corresponding to the dual points X^^^ - X{a^^\A^^) 



as 



A 



The dual direction i> , corresponding to the point a; , from subspace Se is shown in Figure 3. 



"If this point is not unique, take \{y,A) to be any optimal point with minimum Euchdean norm. 



Figure 3: Geometric representation of a dual direction. The dual direction is the dual point 
embedded in the ambient n-dimensional space. 



Definition 2.3 (inradius) The inradius of a convex body V, denoted by r(V), is defined as the 
radius of the largest Euclidean ball inscribed in V. 

Definition 2.4 (subspace incoherence) The subspace incoherence of a point set Xi vis a vis the 

other points is defined by 



fi(Xe) = max 



where V^^^ is as in Definition 2.2. 
Theorem 2.5 // 

^l{X,)<mm r{VU) (2.1) 

for each i = 1, . . . ,L, then the subspace detection property holds. If (2.1) holds for a given i, then a 
local subspace detection property holds in the sense that for all Xi, the solution to (1.2) has nonzero 
entries only when the corresponding columns of X are in the same subspace as xi. 

The incoherence parameter of a set of points on one subspace with respect to other points is a 
measure of afhnity between subspaces. To see why, notice that if the incoherence is high, it implies 
that there is a point on one subspace and a direction on another (a dual direction) such that the 
angle between them is small. That is, there are two 'close' subspaces, hence, clustering becomes 
hard. The inradius measures the spread of points. A very small minimum inradius implies that 
the distribution of points is skewed towards certain directions; thus, subspace clustering using an 
penalty is difficult. To see why this is so, assume the subspace is of dimension 2 and all of the points 
on the subspace are skewed towards one line, except for one special point which is in the direction 
orthogonal to that line. This is shown in Figure 4 with the special point in red and the others in 
blue. To synthesize this special point as a linear combination of the others points from its subspace, 
we would need huge coefficient values and this is why it may very well be more economical — in an 
£i sense — to select points from other subspaces. This is a situation where minimization would 
still be successful but its convex surrogate is not (researchers familiar with sparse regression would 
recognize a setting in which variables are correlated, and which is challenging for the LASSO.) 
Theorem 2.5 essentially states that as long as different subspaces are not similarly oriented and 
the points on a single subspace are well spread, SSC can cluster the data correctly. A geometric 
perspective of (2.1) is provided in Section 4. 
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Figure 4: Skewed distribution of points on a single subspace and £i synthesis. 



To get concrete results, one needs to estimate both the incoherence and inradius in terms of the 
parameters of interest, which include the number of subspaces, the dimensions of the subspaces, 
the number of points on each subspace, and so on. To do this, we use the probabilistic models we 
introduced earlier. This is our next topic. 



2.1.2 Semi-random model 

The following definitions capture notions of similarity/affinity between two subspaces. 

Definition 2.6 The principal angles 0^}^, . . . ,6^^^^'^''^ between two subspaces Sk and Se of dimen- 
sions dk and di, are recursively defined by 

cos(C/^^ ) = max max ■- 



\\l2 



with the orthogonality constraints y yj - 0, z Zj - 0, j - 1, . . . ,i - 1. 

Alternatively, if the columns of U^''^ and U^^^ are orthobases, then the cosine of the principal 
angles are the singular values of U^^'^ U^^^ . We write the smallest principal angle as 9ke - 9^}} so 



rp 

that cos{9ki) is the largest sing ular value of jy^'^) U^'^\ 
Definition 2.7 The affinity between two subspaces is defined by 

aff(5fc, 5,) = v'cos2 0iJ^ + ... + cos2 02'^''^- 

In case the distribution of the points are uniform on their corresponding subspaces, the Geometric 
Condition (2.1) may be reduced to a simple statement about the affinity. This is the subject of the 
next theorem. 

Theorem 2.8 Suppose Ni - pidi + 1 points are chosen on each subspace Si at random, 1 < £ < L. 
Then as long as 

max 4V2l\og[NiiNk + 1)] + logL + A^i^hM < cipi)^]^e, for each I, (2.2) 

k:ki=E \ J y^dk 
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the subspace detection property holds with probability at least 



Hence, ignoring log factors, subspace clustering is possible if the affinity between the subspaces is 
less than about the square root of the dimension of these subspaces. 

To derive useful results, assume for simplicity that we have L subspaces of the same dimension d 
and pd+1 points per subspace so that N - L{pd+\). Then perfect clustering occurs with probability 
at least l-Are-v^'^-(^5^^)^e'2* if 

aff(Sfc,5^) ^ c(p)Vlogp 
Vd 4N/2(21ogiV + t)' 

Our notion of affinity matches our basic intuition. To be sure, if the subspaces are too close to 
each other (in terms of our defined notion of affinity), subspace clustering is hard. Having said 
this, our result has an element of surprise. Indeed, the affinity can at most be \/d (\/dk in general) 
and, therefore, our result essentially states that if the affinity is less than c\/d, then SSC works. 
Now this allows for subspaces to intersect and, yet, SSC still provably clusters all the data points 
correctly! 

To discuss other aspects of this result, assume as before that all subspaces have the same 
dimension d. When d is small and the total number of subspaces is 0{n/d), the problem is 
inherently hard because it involves clustering all the points into many small subgroups. This is 
reflected by the low probability of success in Theorem 2.8. Of course if one increases the number of 
points chosen from each subspace, the problem should intuitively become easier. The probability 
associated with (2.3) allows for such a trend. In other words, when d is small, one can increase 
the probability of success by increasing p. Introducing a parameter < /3 < 1, the condition can be 
modified to 

aff(6'fc,'S'^) ^ c{p)^P \ogp 

4(21ogiV + t)' ^ • ' 

which holds with probability at least 1 - Ne^''^ '^^d _ e~^*. The more general condition 

(2.2) and the corresponding probability can also be modified in a similar manner. 

2.2 Segmentation with outliers 

To see how Algorithm 2 works in the presence of outliers, we begin by introducing a proper threshold 
function, and define 

1 



A(7) = 



' ' 7>e, 



(2.5) 



log 7 

shown in Figure 5. The theorem below justifies the claims made in the introduction. 



Theorem 2.9 Suppose the outlier points are chosen uniformly at random and set 7 = "^^^j then 
using the threshold value (1 -t)^^\/n, all outliers are identified correctly with probability at least 



1 - iVoe ' for some positive numerical constant Ci. Furthermore, we have the following 

guarantees in the deterministic and semi-random models. 
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Figure 5: Plot of the threshold function (2.5). 



(a) If in the deterministic model, 



max L^<(i-t)^y^, (2.6) 



then no 'real' data point is wrongfully detected as an outlier, 
(h) If in the semi-random model, 

max f^ <{l-t)^-J^, (2.7) 

then with probability at least 1-Y,e=i N^e^'^^^^'^^'^ , no 'real' data point is wrongfully detected 
as an outlier. 



The threshold in the right-hand side of (2.6) and (2.7) is essentially y/n multiplied by a factor which 
depends only on the ratio of the number of points and the dimension of the ambient space. 

As in the situation with no outliers, when dg is small we need to increase A'^ to get a result 
holding with high probability. Again this is expected because when di is small, we need to be 
able to separate the outliers from many small clusters which is inherently a hard problem for small 
values of Ng. 

The careful reader will notice a factor y/e discrepancy between the threshold X{'y)\/n presented 
in Algorithm 2 and what is proven in (2.6) and (2.7). We believe that this is a result of our 
analysis'' and we conjecture that (2.6) and (2.7) hold without the factor a/b in the denominator. 
Our simulations in Section 5 support this conjecture. 



3 Discussion and comparison with other work 

It is time to compare our results with a couple of previous important theoretical advances. To 
introduce these earlier works, we first need some definitions. 

Definition 3.1 The subspaces {Si}^^-^ are said to be independent if and only if X!^dim(5'^) = 
dim(®^S'£), where ® is the direct sum. 

For instance, three lines in R'^ cannot be independent. 

^More specifically, from switching from tlie mean widtli to a volumetric argument by means of Urysolm's inequality. 
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Definition 3.2 The subspaces {5^}^^-^ are said to be disjoint if and only if for all pairs k t I, 
5fcnS^ = {0}. 

Definition 3.3 The geodesic distance between two subspaces Si and Sj of dimension d, denoted by 
dist ( 5i, S'j), is defined by 



(iist{Sk-,S() 



3.1 Segmentation without outliers 



In [11], Elhamifar and Vidal show that the subspace detection property holds as long as the sub- 
spaces are independent. In [12], the same authors show that under less restrictive conditions the 
£i subspace detection property still holds. Formally, they show that if 



—= max (Tmin("K) > max cos(6'fP 



max (Tmin("K) > max cos(0L ) for all £ = 1, . . . , L, (3.1) 



then the subspace detection property holds. In the above formulation, cTmin(^) denotes the smallest 
singular value of Y and W(^(X*^^^ ) denotes the set of all full rank sub-matrices of X^^^ of size n>^di. 
The interesting part of the above condition is the appearance of the principal angle on the right- 
hand side. However, the left-hand side is not particularly insightful (i.e. it does not tell us anything 
about the important parameters involved in the subspace clustering problem, such as dimensions, 
number of subspaces, and so on.) and it is in fact NP-hard to even calculate it. 

• Deterministic model. This paper also introduces a sufficient condition (2.1) under which 
the subspace detection property holds in the fully deterministic setting, compare Theorem 2.5. 
This sufficient condition is much less restrictive as any configuration obeying (3.1) also obeys 
(2.1). More, precisely /^(A^) < maxcos(0^^^) and -j= max ^ ^ o"min(^) ^ minjr(Pf^).^ 



As for (3.1), checking that (2.1) holds is also NP-hard in general. However, to prove that the 
subspace detection property holds, it is sufficient to check a slightly less restrictive condition 
than (2.1); this is tractable, see Lemma 7.1. 

Semi-random model. Assume that all subspaces are of the same dimension d and that 
there are pd + \ points on each subspace. Since the columns of Y have unit norm, it is 
easy to see that the left-hand side of (3.1) is strictly less than l/\/d. Thus, (3.1) at best 
restricts the range for perfect subspace recovery to cos0^^^ ^ looking at (3.1), it is 

not entirely clear that this would even be achievable). In comparison. Theorem 2.8 (excluding 
some logarithmic factors for ease of presentation) requires 

aff(5fe, Se) = Vcos2(^iJ^) + cos^e^^^) + ... + cos2(9(f ) < c^log{p)Vd. (3.2) 

The left-hand side of can be much smaller than \/dcos9^^ and is, therefore, less restrictive. 

To be more specific, assume that in the model described above we have two subspaces with 
an intersection of dimension s. Because the two subspaces intersect, the condition given by 



*The latter follows from maxi , < min — ^^1,. which is a simple consequence of Lemma 7.i 



14 



Elhamifar and Vidal becomes 1 < which cannot hold. In comparison, our condition (3.2) 
simphfies to 

cos\eif^^) + . . . + cos' (eg)) < clogip)d-s, 

which holds as long as s is not too large and/or a fraction of the angles are not too small. 
From an application standpoint, this is important because it explains why SSC can often 
succeed even when the subspaces are not disjoint. 

Fully random model. As before, assume for simplicity that all subspaces are of the same 
dimension d and that there are pd + 1 points on each subspace. We have seen that (3.1) 



imposes cos0^^ < It can be shown that in the fully random setting, cos.^^^ ~ "^y n 

Therefore, (3.1) would put a restriction of the form 



d < c\jn. 

In comparison. Theorem 1.2 requires 

d< ci- -n, 

log A* 

which allows for the dimension of the subspaces to be almost linear in the ambient dimension. 

Such improvements come from a geometric insight: it becomes apparent that the SSC algorithm 
succeeds if the actual subspace points (primal directions) have small inner products with the dual 
directions on another subspace. This is in contrast with Elhamifar and Vidal's condition which 
requires that the inner products between any direction on one subspace and any direction on another 
be small. Further geometric explanations are given in Section 4.2. 



3.2 Segmentation with outliers 

To the best of our knowledge, there is only one other theoretical result regarding outlier detection. 
In [26], Lerman and Zhang study the effectiveness of recovering subspaces in the presence of outliers 
by some sort of minimization for different values of < p < oo . They address simultaneous recovery 
of all L subspaces by minimizing the functional 

(A',5i, . . . ,S'l) = Y min(dist(a;,5£))''. (3.3) 

Here, Si,. . . ,Sl are the optimization variables and X is our data set. This is not a convex opti- 
mization for any p > 0, since the feasible set is the Grassmannian. 

In the semi-random model, the result of Lerman and Zhang states that under the assumptions 
stated in Theorem 1.3, with < p < 1 and tq a constant, ^'^ the subspaces Si, . . . ,Sl minimize (with 
large probability) the energy (3.3) among all d-dimensional subspaces in R" if 

No < Topdmin (l, min dist(5fc, Sey/2'P). (3.4) 

^One can see this by noticing that the square of this parameter is the largest root of a multivariate beta distribution. 
The asymptotic value of this root can be calculated e.g. see [21]. 

^°The result of [26] is a bit more general in that the points on each subspace can be sampled from a single distribution 
obeying certain regularity conditions, other than the uniform measure. In this case, to depends on this distribution 
as well. 
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It is easy to see that the right-hand side of (3.4) is upperbounded by pd, i.e. the typical number 
of points on each subspace. Notice that our analogous result in Theorem 1.2 allows for a much 
larger number of outliers. In fact, the number of outliers can sometimes even be much larger than 
the total number of data points on all subspaces combined. Our proposed algorithm also has the 
added benefit that it is convex and, therefore, practical. Having said this, it is worth mentioning 
that the results in [26] hold for a more general outlier model. Also, an interesting byproduct of 
the result from Lerman and Zhang is that the energy minimization can perform perfect subspace 
recovery when no outliers are present. In fact, they even extend this to the case when the subspace 
points are noisy. 

Finally, while this manuscript was in preparation, Liu Guangcan brought to our attention a 
new paper [28], which also addresses outlier detection. However, the suggested scheme limits the 
number of outliers to Nq < n-Y,e=i de- That is, when the total dimension of the subspaces {T,e=i dg) 
exceeds the ambient dimension n, outlier detection is not possible based on the suggested scheme. 
In contrast, our results guarantee perfect outlier detection even when the number of outliers far 
exceeds the number of data points. 

4 Geometric Perspective on the Separation Condition 

The goal of this section is twofold. One aim is to provide a geometric understanding of the subspace 
detection property and of the sufficient condition presented in Section 2.1. Another is to introduce 
concepts such as /C-norms and polar sets, which will play a crucial role in our analysis. 

4.1 Linear programming theory 

We are interested in finding the support of the optimal solution to 

min ||a;|L subject to Ax = y, (4.1) 
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where both y and the columns of A have unit norm. The dual takes the form 

max subject to < 1. (4.2) 

Since strong duality always holds in linear programming, the optimal values of (4.1) and (4.2) are 
equal. We now introduce some notation to express the dual program differently. 

Definition 4.1 The norm of a vector y with respect to a symmetric convex body is defined as 

||y|k = mf{t>0 : y/te/C}. (4.3) 
This norm is shown in Figure 6(a). 
Definition 4.2 The polar set 1C° o//C c R" is defined as 

)C°^{y€ R" : {x, y) < 1 for all x e JC}. (4.4) 
Set /C° - [z ■ WA^zll < l} so that our dual problem (4.2) is of the form 

^ II II -Coo 

max {y,z) subject to zelC°. (4-5) 

It then follows from the definitions above that the optimal value of (4.1) is given by ||i/||a:, 
where /C = conv( ± ai, . . . , ia^v); that is to say, the minimum value of the £i norm is the norm of y 
with respect to the symmetrized convex hull of the columns of A. In other words, this perspective 
asserts that support detection in an ii minimization problem is equivalent to finding the face of 
the polytope /C that passes through the ray y - {ty,t > 0}; the extreme points of this face reveal 
those indices with a nonzero entry. We will refer to the face passing through the ray y as the face 
closest to y. Figure 6(b) illustrates some of these concepts. 



4.2 A geometric view of the subspace detection property 

We have seen that the subspace detection property holds if for each point Xi, the closest face to 

Xi resides in the same subspace. To establish a geometric characterization, consider an arbitrary 
(i) 

point, for instance x^ € Si as in Figure 7. Now construct the symmetrized convex hull of all the 

other points in 5^ indicated by Vi^ in the figure. Consider the face of Vi^ that is closest to a; -^^ ; this 
face is shown in Figure 7 by the line segment in red. Also, consider the plane passing through this 
segment and orthogonal to Si along with its reflection about the origin; this is shown in Figure 7 
by the light grey planes. Set to be the region of space restricted between these two planes. 

Intuitively, if no two points on the other subspaces lie outside of , then the face chosen by the 
algorithm is as in the figure, and lies in S£. 

To illustrate this point further, suppose there are two points not in Si lying outside of the region 
i?.^"* as in Figure 8. In this case, the closest face does not lie in as can be seen in the figure. 
Therefore, one could intuitively argue that a sufficient condition for the closest face to lie in Si is 
that the projections onto S^ of the points from all the other subspaces do not lie outside of regions 
i?. for all points x^ in subspace Sg. This condition is closely related to the sufficient condition 
stated in Theorem 2.5. More, precisely the dual directions v^^^ approximate the normal directions 
to the restricting planes R^^\ and minjr(Pf •) the distance of these planes from the origin. 
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Figure 7: Illustration of £i minimization when the subspace detection property holds. Same 
object seen from different angles. 




Figure 8: Illustration of £i minimization when the subspace detection property fails. Same 
object seen from different angles. 
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Figure 9: Geometric view of (2.1). The right figure is seen from a direction orthogonal to 5i. 



Finally, to understand the sufficient condition of Theorem 2.5, we will use Figure 9. We focus 
on a single subspace, say Si. As previously stated, a sufficient condition is to have all points not 
in Si to have small coherence with the dual directions of the points in Si . The dual directions are 
depicted in Figure 9 (blue dots) . One such dual direction line is shown as the dashed blue line in the 
figure. The points that have low coherence with the dual directions are the points whose projection 
onto subspace 5i lie inside the red polytope. As can be seen, this polytope approximates the 
intersection of regions R^^^ (n^\-Rj-^^) and subspace Si. This helps understanding the difference 
between the condition imposed by Elhamifar and Vidal and our condition; in this setting, their 
condition essentially states that the projection of the points on all other subspaces onto subspace 
Si must lie inside the blue circle. By looking at Figure 9, one might draw the conclusion that these 
conditions are very similar, i.e. the red polytope and the blue ball restrict almost the same region. 
This is not the case, because as the dimension of the subspace Si increases most of the volume of 
the red polytope will be concentrated around its vertices and the ball will only occupy a very small 
fraction of the total volume of the polytope. 



5 Numerical Results 

This section proposes numerical experiments on synthesized data to further our understanding of 
the behavior/limitations of SSC, of our analysis, and of our proposed outlier detection scheme. 
In this numerical study we restrict ourselves to understanding the effect of noise on the spectral 
gap and the estimation of the number of subspaces. For a more comprehensive analytical and 
numerical study of SSC in the presence of noise, we refer the reader to [8]. For comparison of SSC 
with more recent methods on motion segmentation data, we refer the reader to [27, 13]. These 
papers indicate that SSC has the best performance on the Hopkins 155 data [39] when corrupted 
trajectories are present, and has a performance competitive with the state of the art when there is 
no corrupted trajectory. In the spirit of reproducible research, the Matlab code generating all the 
plots is available at [52]. 
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5.1 Segmentation without outliers 

As mentioned in the introduction, the subspace detection property can hold even when the dimen- 
sions of the subspaces are large in comparison with the ambient dimension n. SSC can also work 
beyond the region where the subspace detection property holds because of further spectral cluster- 
ing. Section 5.1.1 introduces several metrics to assess performance and Section 5.1.2 demonstrates 
that the subspace detection property can hold even when the subspaces intersect. In Section 5.1.3, 
we study the performance of SSC under changes in the affinity between subspaces and the number 
of points per subspace. In Section 5.1.4, we illustrate the effect of the dimension of the subspaces 
on the subspace detection property and the spectral gap. In Section 5.1.5, we study the effect of 
noise on the spectral gap. In the final subsection, we study the capability of SSC in estimating the 
correct number of subspaces, and compare it with a classical algorithm. 

5.1.1 Error metrics 

The four different metrics we use are (see [12] for simulations using similar metrics): 
• Feature detection error. For each point Xi, partition the optimal solution of SSC as 



Zil 
Zi2 

ZiL 



In this representation, T is our unknown permutation matrix and Zii,Zi2, . . . , zn denote the 
coefficients corresponding to each of the L subspaces. Using as the total number of points, 
the feature detection error is 



ly(iJ\ph] (5.1) 



N 

in which ki is the subpace Xi belongs to. The quantity between brackets in (5.1) measures 
how far we are from choosing all our neighbors in the same subspace; when the subspace 
detection property holds, this term is equal to whereas it takes on the value 1 when all the 
points are chosen from the other subspaces. 

Clustering error. Here, we assume knowledge of the number of subspaces and apply spectral 
clustering to the affinity matrix built by the SSC algorithm. After the spectral clustering 
step, the clustering error is simply defined as 

# of misclassified points 



total # of points 



(5.2) 



Error in estimating the number of subspaces. This is a 0-1 error which takes on the value 
if the true number of subspaces is correctly estimated, and 1 otherwise. 

Smallest nonzero eigenvalue. We use the (A - L) + 1-th smallest eigenvalue of the normalized 
Laplacian^^ as a numerical check on whether the subspace detection property holds (when 
the subspace detection property holds this value vanishes). 



After building the symmetrized afiinity graph W = \Z\ + {Zf^ , we form the normahzed Laplacian Ln = 
I - D-^I^WD-^I^, where D is a diagonal matrix and Da is equal to the sum of the elements in column Wi. 
This form of the Laplacian works better for spectral clustering as observed in many applications [.32] . 
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(a) 



(b) 



(c) 



Figure 10: Error metrics as a function of the dimension of the intersection, (a) Feature 
detection error, (b) Clustering error, (c) Error in estimating the number of subspaces. 

5.1.2 Subspace detection property holds even when the subspaces intersect 

We wish to demonstrate that the subspace detection property holds even when the subspaces 
intersect. To this end, we generate two subspaces of dimension d = 10 in |R"-=200 -^[^y^ g^j^ intersection 
of dimension s. We sample one subspace {Si) of dimension d uniformly at random among all 
d-dimensional subspaces and a subspace of dimension s (denoted by 5*2^^ ) inside that subspace, 

(2) 

again, uniformly at random. Sample another subspace 5*2 of dimension d-s uniformly at random 



Our experiment selects Ni - N2 - 20d points uniformly at random from each subspace. We 
generate 20 instances from this model and report the average of the first three error criteria over 
these instances, see Figure 10. Here, the subspace detection property holds up to s = 3. Also, after 
the spectral clustering step, SSC has a vanishing clustering error even when the dimension of the 
intersection is as large as s - 6. 

5.1.3 Effect of the affinity between subspaces 

In Section 2.1.2, we showed that in the semi-random model, the success of SSC depends upon 
the affinity between the subspaces and upon the density of points per subspace (recovery becomes 
harder as the affinity increases and as the density of points per subspace decreases). We study here 
this trade-off in greater details through experiments on synthetic data. 

We generate 3 subspaces Si, 5*2, and 5*3, each of dimension d = 20 in [R""^°. The choice n - 2d 
makes the problem challenging since every data point on one subspace can also be expressed as a 
linear combination of points on other subspaces. The bases we choose for Si and S2 are 



and set S2 = S^^^ e S^^\ 




Id 



(5.3) 



21 



whereas for 53, 



cos(^i) 





sin(^i) 









cos(6l2) 

cos(6'3) 





sin(6l2) 

sm(6l3) 



















cos(6'd) 




sin((9rf). 



(5.4) 



Above, the principal angles are set in such a way that cos^j decreases linearly from cos 9 to acosO, 
where 9 and a are fixed parameters; that is to say, cos^j = (l-a(i-l)) cos 9, a - -^^f- 

In our experiments we sample pd points uniformly at random from each subspace. We fix 
a - ^ and vary p e [2, 10] and 9 e [0, ^]. Since a - ^, as 9 increases from to 7r/2, the normalized 
maximum affinity maxj^j aff(5j, Sj)/\/d decreases from 1 to 0.7094 (recall that a normalized affinity 
equal to 1 indicates a perfect overlap, i.e. two subspaces are the same). For each value of p and 9, 
we evaluate the SSC performance according to the three error criteria above. The results, shown in 
Figure 11, indicate that SSC is successful even for large values of the maximum affinity as long as the 
density is sufficiently large. Also, the figures display a clear correlation between the three different 
error criteria indicating that each could be used as a proxy for the other two. An interesting point 
is p - 3.25 and aS/\/d - 0.9; here, the algorithm can identify the number of subspaces correctly and 
perform perfect subspace clustering (clustering error is 0). This indicates that the SSC algorithm 
in its full generality can achieve perfect subspace clustering even when the subspaces are very close. 



5.1.4 Effect of dimension on subspace detection property and spectral gap 

In order to illustrate the effect an increase in the dimension of subspaces has on the spectral gap, we 
generate L - 20 subspaces chosen uniformly at random from all d-dimensional subspaces in R^'^. We 
consider 5 different values for d, namely, 5, 10, 15, 20, 25. In all these cases, the total dimension 
of the subspaces Ld is more than the ambient dimension n = 50. We generate 4:d unit-normed 
points on each subspace uniformly at random. The corresponding singular values of the normalized 
Laplacian are displayed in Figure 12. As evident from this figure, the subspace detection property 
holds, when the dimension of the subspaces is less than 10 (this corresponds to the last eigenvalues 
being exactly equal to 0). Beyond d = 10, the gap is still evident, however, the gap decreases as d 
increases. In all these cases, the gap was detectable using the sharpest descent heuristic presented 
in Algorithm 1 and thus, the correct estimates for the number of subspaces were always found. 

5.1.5 Effect of noise on spectral gap 

In order to illustrate the effect of noise on the spectral gap, we sample L - 10 subspaces chosen 
uniformly at random from all d - 20-dimensional subspaces in R^^. The total dimension of the 
subspaces {Ld - 200) is once again more than the ambient dimension n - 50. We then sample 
points on each subspace — 4d per subspace as before — and perturb each unit-norm data point Xi 
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(c) Error in estimating the number of subspaces. 



Figure 11: Performance of the SSC algorithm for different values of the affinity and density 
of points per subspace. In all three figures, the horizontal axis is the density p, and the vertical 
axis is the normalized maximum affinity vnox^i^j 

by a noisy vector chosen independently and uniformly at random on the sphere of radius a (noise 
level) and then normalize to have unit norm. The noisy samples are Xi - || — , where H^iH^^ ~ ^• 
We consider 9 different values for the noise level, namely, 0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 
0.4. The corresponding singular values of the normalized Laplacian are shown in Figure 13. As 
evident from this figure, we are in a regime where the subspace detection property does not hold 
even for noiseless data (this corresponds to the last eigenvalues not being exactly equal to 0). For 
a positive, the gap is still evident but decreases as a function of a. In all these cases, the gap was 
detectable using the sharpest descent heuristic presented in Algorithm 1, and thus the number of 
subspaces was always correctly inferred. 

5.1.6 Comparison with other methods 

We now hope to demonstrate that one of the main advantages of SSC is its ability to identify, in 
much broader circumstances, the correct number of subspaces using the eigen-gap heuristic. Before 
we discuss the pertaining numerical results, we quickly review a classical method in subspace 
clustering [10]. Start with the rank-r SVD X - LfSiV^ of the data matrix, and use W - VV'^ as 
the affinity matrix. (Interestingly, the nuclear-norm heuristic also results in the same affinity matrix 
[27, 13]). It was shown in [10] that when the subspaces are independent, the affinity matrix will 
be block diagonal and one can thus perform perfect subspace clustering. When the subspaces are 
not independent, the affinity matrix may occasionally be approximately block diagonal as observed 
empirically in some particular computer vision applications. In the presence of noise, or when the 
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Figure 12: Gaps in the eigenvalues of the normalized Laplacian as a function of subspace 
dimension. 
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Figure 13: Gaps in the eigenvalues of the normalized Laplacian for different values of the 
noise level a. 

independence assumption is violated, various methods have been proposed to "clean up" the affinity 
matrix and put it into block diagonal form [10, 23, 20, 46, 24, 22]. As noted by Vidal in [45] most 
of these algorithms need some knovi^ledge of the true data rank and/or dimension of the subspaces. 
Furthermore, none of these algorithms have been proven to work when the independence criterion 
is violated — in contrast with the analysis presented in this paper. 

We believe that a major advantage of SSC vis a vis more recent approaches [27, 13] is that 
the eigen-gap heuristic is applicable under broader circumstances. To demonstrate this, we sample 
-L = 10 subspaces chosen uniformly at random from all 10-dimensional subspaces in IR^*^. The total 
dimension Ld - 100 is once more larger than the ambient dimension n = 50. The eigenvalues of the 
normalized Laplacian of the affinity matrix for both SSC and the classical method {W - W"^) 
are shown in Figure 14 (a). Observe that the gap exists in both plots. However, SSC demonstrates 
a wider gap and, therefore, the estimation of the number of subspaces is more robust to noise. To 
illustrate this point further, consider Figure 14 (b) in which points are sampled according to the 
same scheme but with d = 30, and with noise possibly added just as in Section 5.1.5. Both in the 
noisy and noiseless cases, the classical method does not produce a detectable gap while the gap is 
detectable using the simple methodology presented in Algorithm 1. 
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Figure 14: Gaps in the eigenvalues of the normahzed Laplacian for the affinity graphs, (a) 
Noiseless setup with d= 10 (the zoom is to see the gap for classical method more clearly), (b) 
Noiseless and noisy setups with d=30. 



5.2 Segmentation with outliers 

We now turn to outlier detection. For this purpose, we consider three different setups in which 



• 




5, 


n - 


50, 


• 


d = 
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n - 


100, 


• 


d = 


5, 


n - 


200. 



In each case, we sample L - 2n/d subspaces chosen uniformly at random so that the total dimension 
Ld - 2n. For each subspace, we generate 5d points uniformly at random so that the total number 
of data points is - Wn. We add Nq - outliers chosen uniformly at random on the sphere. 
Hence, the number of outliers is equal to the number of data points. The optimal values of the 
optimization problems (1.2) are plotted in Figure 15. The first values correspond to the data 
points and the next Nq values to the outliers. As can be seen in all the plots, a gap appears in the 
values of the ii norm of the optimal solutions. That is, the optimal value for data points is much 
smaller than the corresponding optimal value for outlier points. We have argued that the critical 
parameter for outlier detection is the ratio d/n. The smaller, the better. As can be seen in Figure 15 
(a). The ratio d/n = 1/10 is already small enough for the conjectured threshold of Algorithm 2 to 
work and detect all outlier points correctly. However, it wrongfully considers a few data points as 
outliers. In Figure 15 (b), d/n - 1/20 and the conjectured threshold already works perfectly but 
the proven threshold is still not able to do outlier detection well. In Figure 15 (c), d/n - 1/40, both 
the conjectured and proven thresholds can perform perfect outlier detection. (In practice it is of 
course not necessary to use the threshold as a criterion for outlier detection; one can instead use a 
gap in the optimal values.) It is also worth mentioning that if d is larger, the optimal value is more 
concentrated for the data points and, therefore, both the proven and conjectured threshold would 
work for smaller ratios of d/n (this is different from the small values of d above). 
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6 Background on Geometric Functional Analysis 

Our proofs rely heavily on techniques from Geometric Functional Analysis and we now introduce 
some basic concepts and results from this field. Most of our exposition is adapted from [41]. 

Definition 6.1 The maximal and average values of || • \)q on the sphere 5""^ are defined by 
b{lC) - sup \x\]<z and M{K,) - / \x\]<zda{x). 

Above, a is the uniform probability measure on the sphere. 

Definition 6.2 The mean width M*(/C) of a symmetric convex body fC in R" is the expected value 
of the dual norm over the unit sphere, 

M*(/C) = Af(/C°) = [ \\y\\K,oda{y)^ f max(y, z) 

With this in place, we now record some useful results. 
Lemma 6.3 We always have M(/C)M(/C°) > 1. 

Proof Observe that since || • \\fco is the dual norm of || • \\x\\'^ - \\x\\ic \\x\\i<^o and thus 

where the inequality follows from Cauchy-Schwarz. ■ 

The following theorem deals with concentration properties of norms. According to [25], these 
appear in the first pages of [31]. 

Theorem 6.4 (Concentration of measure) For each t>0, we have 

a{x 6 : - M{JC)\ > tM{}C)} < e^p(-ct\ [^^f), 

where c> is a universal constant. 

The following lemma is a simple modification of a well-known result in Geometric Functional 
Analysis. 

Lemma 6.5 (Many faces of convex symmetric polytopes) Let V be a symmetric polytope 
with f faces. Then 

»(^)%.M/), 

for some positive numerical constant c > 0. 

Definition 6.6 (Geometric Banach-Mazur Distance) Let /C and C be symmetric convex bod- 
ies in R". The Banach-Mazur distance between fC and C, denoted by d{lC,C), is the least positive 
value ab €R for which there is a linear image T(/C) of IC obeying 

b-^C c T(/C) c aC. 
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Theorem 6.7 (John's Theorem) Let K, he a symmetric convex body in IR" and B2 he the unit 
hall of R". Then d{IC, Bl^) < 

Our proofs make use of two theorems concerning volume ratios. The first is this. 
Lemma 6.8 (Urysohn's inequality) Let /C c R" be a compact set. Then 

vol(/C) 



vol(i?2") 



< M*{)C). 



Lemma 6.9 [3, Theorem 2] Let IC" ^ {z € R" : \{ai,z)\ < 1 : i = 1, . . . , iV} with \\ai\\^^ = 1. The 
volume of 1C° admits the lower estimate 



'2V2 



vol 



Here, n<N,l<p<oo and r ^ {^zZi\\at\\%)^ ■ 



P>2, 
i/1 <p<2. 



7 Proofs 

To avoid repetition, we define the primal optimization problem P{y, A) as 

min lla^ll^^ subject to Ax - y, 

and its dual D{y, A) as 

max {y,v) subject to < 1. 

We denote the optimal solutions by optsolP(y, A) and optsolD(y, A). Since the primal is a linear 
program, strong duality holds, and both the primal and dual have the same optimal value which we 
denote by optval(y, A) (the optimal value is set to infinity when the primal problem is infeasible). 
Also notice that as discussed in Section 4, this optimal value is equal to ||i/||a:) where JC(A) = 
conv(±ai, . . . , ittTv) and /C"(A) = {z : 11 A'^a;!! < 1}. 



7.1 Proof of Theorem 2.5 

We first prove that the geometric condition (2.1) implies the subspace detection property. We 
begin by establishing a simple variant of a now classical lemma (e.g. see [7]). Below, we use the 
notation As to denote the submatrix of A with the same rows as A and columns with indices in 
5c {!,..., AT}. 

Lemma 7.1 Consider a vector y e R" and a matrix A e R"'*^. // there exists c obeying y - Ac 
with support S '=T, and a dual certificate vector u satisfying 

A^v= sgn{cs), \\a'^^scv\\ < 1, ||a;^cI^|| < 1, 

II II II II -t-oo 

then all optimal solutions z* to P{y,A) obey z^c - 0. 
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Proof Observe that for any optimal solution z* of P(y, A), we have 



l%ll£i + Il^rn5=ll^i + II 



^ llcsll^, + (sgn(cs), 2:5-05) + II z-^n5-llfi + Nr-ll^i 
= llcsllfj + {v,As{Zs - C5)) + llz^ns-ll^i + Nt-II^i 



Now note that 



In a similar manner, we have {y ^ A^cz'^c) < \\A'^cI^\\^ W^T^Wej- Hence, using these two identities 
we get 



,^>||c||,^ + (l-||A?ci/||,^) 



I 



Since z* is an optimal solution, ||-z*||^^ < ||c||^^, and plugging this into the last identity gives 

(1 - ||Ai^ci/||^^) ||2;^c||^^ < 0. 
Now since IIA^cI^IL < 1, it follows that llz^dL = 0. 



Consider x^^^ - U^^^a^^\ where U^^^ e ^^^d,, -g orthogonal basis for S"^ and define 

cf). optsolP(af),AW). 
Letting S be the support of , define A^^^ as an optimal solution to 

Af^^ subject to {(Ai?)^Af ) - sgn(cf )), \{A%:>^^ 



A- ' - argmm 



(■2 



< 1 



Because is optimal for the primal problem, the dual problem is feasible by strong duality and 
the set above is nonempty. Also, A^^^ is a dual point in the sense of Definition 2.1; i.e. A^^^ = 
\{af^ , A^j^ ) . Introduce 



so that the direction of v} is the ith dual direction; i.e. ly} 



A 



(see Definition 2.2). 



Put T to index those columns of in the same subspace as x- (subspace Si). Using this 
definition, the subspace detection property holds if we can prove the existence of vectors c (obeying 
Cfc = 0) and u as in Lemma 7.1 for problems P(^x^^\ X^i^ of the form 



mm 



z L subject to X^iZ - x) 



(7.1) 
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We set to prove that the vectors c - |^0, . . . , 0, c^^\ 0, . . . , , which obeys ct<: - and is feasible 

for (7.1), and u^^'^ are indeed as in Lemma 7.1. To do this, we have to check that the following 
conditions are satisfied: 



(Xl?)^.f).sgn(cr) 



< 1, 



and for all x € X \ Xf 



< 1. 



(7.2) 
(7.3) 

(7.4) 



Conditions (7.2) and (7.3) are satisfied by definition, since 



and 



< 1. 



Therefore, in order to prove that the subspace detection property holds, it remains to check that 
for all X € X \ Xf we have 



i2 



< 1. 



By definition of A- 



< 1, and therefore, A-^"* e ("Pfj)", where 



T 



<1 . 



Definition 7.2 (circumradius) The circumradius of a convex body V , denoted by R{V), is de- 
fined as the radius of the smallest ball containing V . 

Using this definition and the fact that A^ e {j^-if have 



where the equality is a consequence of the lemma below. 

Lemma 7.3 [6, page 44^] ^^r a symmetric convex body V , i.e. V - -V , the following relationship 
between the inradius of V and circumradius of its polar V holds: 



r{V)R{V°) = 1. 



In summary, it suffices to verify that for all pairs {C-,i) (a pair corresponds to a point x\^'^ e X() 
and all X e \ Af^, we have 

{x,vf^)\<r{vU). 

Now notice that the latter is precisely the sufficient condition given in the statement of Theorem 
2.5, thereby concluding the proof. 
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7.2 Proof of Theorem 2.8 

We prove this in two steps. 

Step 1: We develop a lower bound about the inradii, namely, 



P 



< r{Vii) for all pairs > 1 - XI ^^e" 



(7.5) 



Step 2: Notice that p{Xe,) - maxk-.kte 
about the subspace incoherence, name 



. Therefore we develop an upper bound 



P 



<4{\og[Ni{Nk + l)] + \ogL + t)^^^^^^ for all pairs {e,k) with £ ^ A: ] 



dk\/de 



(7.6) 



Notice that if the condition (2.2) in Theorem 2.8 holds, i.e. 



max 4x/2^ log[iV,(iV, + i)] + bg L + t J ^^^^^ < c{pi)V^e, 



then Step 1 and Step 2 imply that the deterministic condition in Theorem 2.5 holds with high 
probability. In turn, this gives the subspace detection property. 



7.2.1 Proof of Step 1 

Here, we simply make use of a lemma stating that the inradius of a polytope with vertices chosen 
uniformly at random from the unit sphere is lower bounded with high probability. 

Lemma 7.4 ([2]) Assume {Pi}iii are independent random vectors on S'^"^, and set K, - conv(±Pi, 
. . . , ±Pn)- For every (5 > 0, there exists a constant C{5) such that if (1 + 5)d < N < de^ , then 



( log - 

P|r(/C) <min{C(5),l/\/8}A^ 

Furthermore, there exists a numerical constant 6o such that for all N > d{l + 5o) we have 



1 



P r(/C) < 



d 

One can increase the probability with which this lemma holds by introducing a parameter < /3 < 1 
in the lower bound ([15]). A modification of the arguments yields (note the smaller bound on the 
probability of failure) 



p|r(/C) <min{C(5),l/\/8 



log 



N 



d 



< e 
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This is where the definition of the constant c(p) comes in. We set c{p) = min{C(p- 1), 1/^8} and 
Po - ^0 + ^ where 5o is as in the above Lemma and use /? = |- Now since j consists of 2{N£ - 1) 
vertices on S*^*"^ taken from the intersection of the unit sphere with the subspace Se of dimension 
d£, applying Lemma 7.4 and using the union bound estabhshes (7.5). 

7.2.2 Proof of Step 2 

By definition 





= max 




= max 








2 







A 



(7.7) 



Now it follows from the uniform distribution of the points on each subspace that the columns of 
are independently and uniformly distributed on the unit sphere of R'^*. Furthermore, the 
normalized dual points^'^ A^^'/ A,.^^ ^ are also distributed uniformly at random on the unit sphere 



of R To justify this claim, assume U is an orthogonal transform on R^ and a[ {U) is the dual 

I in 

point corresponding to Uai and UA_/. Then 



(7. 



where we have used the fact that A^ is the dual variable in the corresponding optimization problem. 
On the other hand we know that 



(7.9) 



where X ~ y means that the random variables X and Y have the same distribution. This follows 
from Uai ~ ttj and UA^) ~ A^) since the columns of A^^^ are chosen uniformly at random on the 
unit sphere. Combining (7.8) and (7.9) implies that for any orthogonal transformation U, we have 

which proves the claim. 

Continuing with (7.7), since A^-^^ and A^''^ are independent, applying Lemma 7.5 below with 
A = NfL, Ni - Nf:, di - dk, and ^2 = di gives 









i2 



<4{log[Ne{Nk + l)] + log L + t) 



di.\i do 



with probability at least 1 



(Affc+l)Aff2L2 



Finally, applying the union bound twice gives (7.6). 



Recall that c(p) is defined as a constant obeying the following two properties: (i) for all p > 1, c(p) > 0; (ii) there 
is a numerical value po, such that for all p > po, one can take c(p) = 

Since the columns of AS''^ are independently and uniformly distributed on the unit sphere of R"^*^ , A^^^ in Definition 
2.1 is uniquely defined with probabilty 1. 
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Lemma 7.5 Let A e [R^i^^i he a matrix with columns sampled uniformly at random from the unit 
sphere of , X e R'^^ he a vector sampled uniformly at random from the unit sphere of R*^^ and 
independent of A and 5] e ^di^d2 ^ deterministic matrix. For any positive constant A, we have 



IA^SAL <4(log(iVi + l) + logA + t)- "^"^ 



'<h\/d2 



with probability at least 1 - jj^^^ij^e. 



Proof The proof is standard. Without loss of generahty, we assume di < d2 as the other case is 
similar. To begin with, the mapping A ||X1A||^^ is Lipschitz with constant at most ui (this is the 
largest singular value of S). Hence, Borell's inequality gives 



P{||5]A||,^-^E||I]A|||>e}< 



Because A is uniformly distributed on the unit sphere, we have E||SA||£^ = ||S||p/(i2. Plugging 



e - (6-1)^^ into the above inequality, where b = 2'\/log(A^i + 1) + log A + t, and using || 5] || /fii > 1 



give 



P(||5]A|L >6^)<- 

^" "^^ Vd^^ (iVi + l)2A2 



Further, letting a e R'^^ be a representative column of A, a well-known upper bound on the area 
of spherical caps gives 



P{|a^z| > e||z||^2 } < 2e^ 



in which z is a fixed vector. We use z = SA, and e = b/-\/di. Therefore, for any column a of A we 
have 



I , T , b „ „ 1 -dl-!^ 2 



-2t 



Now applying the union bound yields 



P IIA^SAL > —= IISAL, < rin^e-^K 



Plugging in the bound for ||SA||^^ concludes the proof. ■ 

7.3 Proof of Theorem 1.2 

We prove this in two steps. 

Step 1: We use the lower bound about the inradii used in Step 1 of the proof of Theorem 
2.8 with /3 = |, namely, 

pj^y^SZ < r{V^_.) for aU pairs > 1 - Ne-^"^. 
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Step 2: We develop an upper bound about subspace incoherence, namely, 



pjMTO < for alH|>l-|. 

To prove Step 2, notice that in the fully random model, the marginal distribution of a column x 
is uniform on the unit sphere. Furthermore, since the the points on each subspace are sampled 
uniformly at random, the argument in the proof of Theorem 2.8 asserts that the dual directions 
are sampled uniformly at random on each subspace. By what we have just seen, the points ' are 
then also distributed uniformly at random on the unit sphere (they are not independent). Lastly, 
the random vectors v)^ and x € X \ X£ are independent. The distribution of their inner product is 
as if one were fixed, and applying the well-known upper bound on the area of a spherical cap gives 



V 



> 



eiogiv 2 



n 



Step 2 follows by applying the union bound to at most A^^ such pairs. 
7.4 Proof of Theorem 2.9 

We begin with two lemmas relating the mean and maximal value of norms with respect to convex 
polytopes. 

Lemma 7.6 For a symmetric convex body in R", 

M(/C)Af(/C") J_ 
6(/C)6(/C°) " 

Proof Variants of this lemma are well known in geometric functional analysis. By definition, 

\\x\\ic < b{IC)\\x\\2, 
WxWic < 6(/C°)||x||2, 

and, hence, the property of dual norms allows us to conclude that 

1|2;||2 < \\x\\ic < 6(/C)||x||2, 



6(/C°) 

TT^Ikb < \\x\\fC° ^ b{^")\\x\\2- 

However, using Definition 6.6, these relationships imply that d(IC,B2 ) - 6(/C)6(/C°). Therefore, 

M(/C)M(/C°) _ M(/C)M(/C°) 
6(/C)6(/C°) " d{lC,B^) ' 

Applying John's lemma and using Lemma 6.3 concludes the proof. 
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Lemma 7.7 For a convex symmetric polytope 1C{A), A e R"'*^, we have 

iM(lC)\ n 



5(/C) > - log(2iV) 



Proof By Lemma 7.6, we know that 



M{K.)M{K.°) ^ J_ ^ M{K.) ^ 1 



fe(/C)6(/C°) b{JC) 
However, applying Lemma 6.5 to the polytope /C°, which has at most 2N faces, gives 

These two inequalities imply 

M(/C) 1 lM(lC)\ 1 n 



b{K:) - ^C\og{2N) ^ b{lC) I " Clog(2iV)' 



7.4.1 Proof of Theorem 2.9 (part (a)) 

The proof is in two steps. 



1- For every inlier point x\^\ 



optval(a.f\x_,)<^. (7.10) 



2- For every outlier point x^^\ with probability at least 1 - e '^log^ , we have 

(1 - < optval(a:[°\ 



Proof of step 1 

Lemma 7.8 Suppose y e Range(A) , then 

\\y\\ 



optval(y, a) 



< 



r(/C(A)) 
Proof As stated before, 

optval(y. A) = 

Put /C(A) - IC for short. Using the definition of the max norm and circumradius 



\\y\\ic = \\y\\i2 



y 



\\y\\ 



yje^ 



<||z/||,^6(/C) = ||y||,^i?(/C°) = ^. (7.11) 
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The last equality follows from the fact that maximal norm on the unit sphere and the inradius are 
the inverse of one another (Lemma 7.3). ■ 

Notice that 

optval(a;f ^ ,X-i)< optval(a;f ^ , X^f ) . 



and since 



1 , applying the above lemma with y - xf^ and A - X^^-^ , gives 



optval(a;f\xi?) < 



Combining these two identities establishes (7.10). 



Proof of step 2 We are interested in lower bounding optval(y, A) in which A is a fixed matrix 
and y e R" is chosen uniformly at random on the unit sphere. Our strategy consists in finding a 
lower bound in expectation, and then using a concentration argument to derive a bound that holds 
with high probability. 

Lemma 7.9 (Lower bound in expectation) Suppose y € R"" is a point chosen uniformly at 
random on the unit sphere and A e [R"'*^ is a matrix with unit-norm columns. Then 



E{optval(y,A)} 



1 f2 n 

^/eV T ^/]V 



if !<-<€, 



Proof Since optval(y, A) = I|2/|Ia:(a)' expected value is equal to M*{K,°) - M{)C). Applying 
Urysohn's Theorem (Theorem 6.8) gives 

It is well known that the volume of the n-dimensional sphere with radius one is given by 



VOl(i?2") 



vr 



n/2 



r(? + i) 



The well-known Stirling approximation gives 

- n 



and, therefore, the volume obeys 



(v¥)" 



Note that if {a,}^^ is a family of n-dimensional unit-norm vectors, then for p> 1, 



N 



(-EkD <(-) • 
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Applying Lemma 6.9 for p>2 gives 



vol(/C°)" > 



The right-hand side is maximum when p - 2 log ^, which is larger than 2 as long as ^ > e. When 



TV 
n 



< e, we shall use p - 2. Plugging in this value of p in the bound of Lemma 6.9, we conclude that 



vol(/C°)" > 



^ JV ' 

n 



if 



N 
n 

N 



if 1 < ^ < e, 



> e. 



Finally, this idenitity together with the approximation of the volume of the sphere conclude the 
proof. ■ 



Lemma 7.10 (Concentration around mean) In the setup of Lemma 7.9, 

optval(y,A) > (1 - t)E{optval(y, A)}, 

with probability at least 1 - e iog(2JV) . 

Proof The proof follows from Theorem 6.4 and applying Lemma 7.7. ■ 

These two lemmas (Lower bound in expected value and Concentration around mean), combined 
with the union bound give the first part of Theorem 2.9. 



7.4.2 Proof of Theorem 2.9 part (b) 

This part follows from the combination of the proof of Theorem 2.9 part (a) with the bound given 
for the inradius presented in the proof of Theorem 2.8. 



7.5 Proof of Theorem 1.3 

The proof follows Theorem 2.9 with t a small number. Here we use t = 1 — ]=. 

v2 
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